INTRODUCTION

The goal of this machine learning project is to attempt to predict the genre of about 50000 tracks obtained from the Spotify API, via Kaggle, as one of the following:
‘Electronic’, ‘Anime’, ‘Jazz’, ‘Alternative’, ‘Country’, ‘Rap’, ‘Blues’, ‘Rock’, ‘Classical’, or ‘Hip-Hop’.

Spotify is a music streaming platform with 406 million monthly users. Here is their “About Us” page for some more info:
[1]: https://newsroom.spotify.com/company-info/

According to Oxford Dictionary, genre is a category of artistic composition, characterized by similarities in form, style, or subject matter. With this project, I hope to gain insight into the question of whether genre is inherent to the nature of music, or if it is a product of human nature and our tendencies to look for patterns in the world around us. So, we begin…

Starting Off: Loading Data and Taking a Look

For a definition of the numerous variables discussed throughout this project, reference my attached codebook.

#reading in data file
music<-read.csv("/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/data/music_genre.csv")
glimpse(music) #taking a look at the initial dataset
## Rows: 50,005
## Columns: 18
## $ instance_id      <dbl> 32894, 46652, 30097, 62177, 24907, 89064, 43760, 3073…
## $ artist_name      <chr> "Röyksopp", "Thievery Corporation", "Dillon Francis",…
## $ track_name       <chr> "Röyksopp's Night Out", "The Shining Path", "Hurrican…
## $ popularity       <dbl> 27, 31, 28, 34, 32, 47, 46, 43, 39, 22, 30, 27, 31, 3…
## $ acousticness     <dbl> 4.68e-03, 1.27e-02, 3.06e-03, 2.54e-02, 4.65e-03, 5.2…
## $ danceability     <dbl> 0.652, 0.622, 0.620, 0.774, 0.638, 0.755, 0.572, 0.80…
## $ duration_ms      <dbl> -1, 218293, 215613, 166875, 222369, 519468, 214408, 4…
## $ energy           <dbl> 0.941, 0.890, 0.755, 0.700, 0.587, 0.731, 0.803, 0.70…
## $ instrumentalness <dbl> 7.92e-01, 9.50e-01, 1.18e-02, 2.53e-03, 9.09e-01, 8.5…
## $ key              <chr> "A#", "D", "G#", "C#", "F#", "D", "B", "G", "F", "A",…
## $ liveness         <dbl> 0.1150, 0.1240, 0.5340, 0.1570, 0.1570, 0.2160, 0.106…
## $ loudness         <dbl> -5.201, -7.043, -4.617, -4.498, -6.266, -10.517, -4.2…
## $ mode             <chr> "Minor", "Minor", "Major", "Major", "Major", "Minor",…
## $ speechiness      <dbl> 0.0748, 0.0300, 0.0345, 0.2390, 0.0413, 0.0412, 0.351…
## $ tempo            <chr> "100.889", "115.00200000000001", "127.994", "128.014"…
## $ obtained_date    <chr> "4-Apr", "4-Apr", "4-Apr", "4-Apr", "4-Apr", "4-Apr",…
## $ valence          <dbl> 0.7590, 0.5310, 0.3330, 0.2700, 0.3230, 0.6140, 0.230…
## $ music_genre      <chr> "Electronic", "Electronic", "Electronic", "Electronic…
n_distinct(music$music_genre) #how many genres are there?
## [1] 11

Right off the bat, I notice some things about this dataset. First, there are 11 distinct genres present, but only 10 listed on the Kaggle codebook from which I obtained this data. Maybe there are some null values present. Second, most of the features of this data are numerical, besides names of tracks and artist, obtained date, key, and mode. Oddly, tempo is recorded as a character variable, but the entries look to be numerical, so this is something I will have to address later on in the project. Key and mode are naturally categorical data, so they will have to be addressed via dummy or one-hot coding later on. Finally, not all of the above variables will be useful in identifying the genre of given tracks, so I will either remove them or replace them with characteristics derived from them.

To take a closer look at the distribution of each numerical variable:

summary(music) # to see distributions of the numerical variables
##   instance_id    artist_name         track_name          popularity   
##  Min.   :20002   Length:50005       Length:50005       Min.   : 0.00  
##  1st Qu.:37974   Class :character   Class :character   1st Qu.:34.00  
##  Median :55914   Mode  :character   Mode  :character   Median :45.00  
##  Mean   :55888                                         Mean   :44.22  
##  3rd Qu.:73863                                         3rd Qu.:56.00  
##  Max.   :91759                                         Max.   :99.00  
##  NA's   :5                                             NA's   :5      
##   acousticness     danceability     duration_ms          energy        
##  Min.   :0.0000   Min.   :0.0596   Min.   :     -1   Min.   :0.000792  
##  1st Qu.:0.0200   1st Qu.:0.4420   1st Qu.: 174800   1st Qu.:0.433000  
##  Median :0.1440   Median :0.5680   Median : 219281   Median :0.643000  
##  Mean   :0.3064   Mean   :0.5582   Mean   : 221253   Mean   :0.599755  
##  3rd Qu.:0.5520   3rd Qu.:0.6870   3rd Qu.: 268612   3rd Qu.:0.815000  
##  Max.   :0.9960   Max.   :0.9860   Max.   :4830606   Max.   :0.999000  
##  NA's   :5        NA's   :5        NA's   :5         NA's   :5         
##  instrumentalness       key               liveness          loudness      
##  Min.   :0.000000   Length:50005       Min.   :0.00967   Min.   :-47.046  
##  1st Qu.:0.000000   Class :character   1st Qu.:0.09690   1st Qu.:-10.860  
##  Median :0.000158   Mode  :character   Median :0.12600   Median : -7.277  
##  Mean   :0.181601                      Mean   :0.19390   Mean   : -9.134  
##  3rd Qu.:0.155000                      3rd Qu.:0.24400   3rd Qu.: -5.173  
##  Max.   :0.996000                      Max.   :1.00000   Max.   :  3.744  
##  NA's   :5                             NA's   :5         NA's   :5        
##      mode            speechiness         tempo           obtained_date     
##  Length:50005       Min.   :0.02230   Length:50005       Length:50005      
##  Class :character   1st Qu.:0.03610   Class :character   Class :character  
##  Mode  :character   Median :0.04890   Mode  :character   Mode  :character  
##                     Mean   :0.09359                                        
##                     3rd Qu.:0.09853                                        
##                     Max.   :0.94200                                        
##                     NA's   :5                                              
##     valence       music_genre       
##  Min.   :0.0000   Length:50005      
##  1st Qu.:0.2570   Class :character  
##  Median :0.4480   Mode  :character  
##  Mean   :0.4563                     
##  3rd Qu.:0.6480                     
##  Max.   :0.9920                     
##  NA's   :5

At a glance, I see that most of the variables are recorded in the range of 0-1, besides popularity, duration, and loudness. This makes sense. Instance ID seems to be some sort of index, so theres no real meaning to its distribution.

Onto data cleaning!

DATA CLEANING

n_distinct(music$obtained_date) #unique dates
## [1] 6

Right off the bat, I can determine that obtained date won’t be helpful in predicting genre, as it has nothing to do with the individual tracks but rather when the data was recorded from Spotify; thus, I will go ahead and remove it before any of my EDA, so as not to distract from my analysis. Instance ID is just an index so it will be removed as well:

#remove obtained date, instance id
music <- music %>%
  select(-obtained_date, -instance_id)

I will begin by checking for NULL values, or missing data in the dataset:

music[rowSums(is.na(music)) > 0, ] #how many rows are fully null?

I see 5 observations that are filled with null values. Since these are so few, I will just go ahead and remove them:

music <- music %>%
  drop_na() #remove null values
count(music) #how many observations left?
head(music) #first 6 observations of the dataset

That leaves us with 50000 observations, or individual tracks, to explore, the first 6 of which are displayed above.

Time to deal with the mysterious character tempo, by simply converting it to a double as it should be:

music$tempo <- as.double(as.character(music$tempo))

That is as much data cleaning as I can conduct initially. Perhaps the EDA will reveal more issues with the data that need to be sorted out (spoiler: it does).

EDA (Exploratory Data Analysis)

I will examine each variable’s distribution by genre, as well as compare some variables against eachother.

GENRE

music %>%
  ggplot(aes(music_genre, fill=music_genre)) + #color by genre
  geom_bar() + #bar plot of genre counts
  labs(title = "Distribution of Genre" )

I can see that is data is perfectly balanced (aka it has equal distribution among genres). Upon further research, I learned that this means accuracy will be a good metric to use to determine the goodness of fit of the model later on.

ARTIST NAME

n_distinct(music$artist_name) #how many unique artists?
## [1] 6863
n_distinct(music$track_name) #how many unique songs?
## [1] 41699
#some artist names are marked as "empty field"; how many?
sum(music$artist_name=="empty_field")
## [1] 2489
music %>%
  filter(str_detect(artist_name, "empty_field")) %>%
  head()

There are 6863 unique artists present in the data, 41699 unique tracks (meaning there must be some duplicate tracks), and 2489 artist names marked as “empty_field”. I won’t remove these observations, because they might still be useful in helping the model predict the genre of other tracks.

In order to make artist names possibly a bit more useful to the model, I will instead try to use the length of the names (as in how many characters are present in the string) to glean some information about their relation to genre. I will store the lengths in a new column in the dataset:

music$Aname_length = str_length(music$artist_name) #new column of artist name lengths
head(music$Aname_length) #first 6 observations
## [1]  8 20 14  8 11 10

Lets look at the distribution of artist name length by genre.

music %>%
  ggplot(aes(x=music_genre, y=Aname_length, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Length of Artist Name by Genre", x="Genre", y="Artist Name Length" )

Generally, classical music artists tend to have longer names, while the rest of the genres are quite similarly distributed. This could be useful.

TRACK NAME

Looking at length of track names and creating a new column in the data to store them:

music$Tname_length = str_length(music$track_name) #new column of track name lengths
head(music$Tname_length) #first 6 observations
## [1] 20 16  9  5 16  5

and once again their distributions by genre:

music %>%
  ggplot(aes(x=music_genre, y=Tname_length, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Length of Track Name by Genre", x="Genre", y="Track Name Length" )

Here, there is a much more pronounced difference than in lengths of artist names. Generally, name length of classical tracks is greater than any other genre. I’ll keep this in mind.

POPULARITY

#distribution of popularity by genre
music %>%
  ggplot(aes(reorder(music_genre, popularity, sum), y=popularity, fill=music_genre)) + 
  geom_col() + #barplot
  labs(title = "Distribution of Popularity by Genre", x="Genre", y="Popularity (Totaled)")

Rap, rock, and hip-hop are the most popular genres, while anime and classical are least popular, with the rest sitting somewhere in the middle.

ACOUSTICNESS

music %>%
  ggplot(aes(x=music_genre, y=acousticness, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Acousticness by Genre", x="Genre", y="Acousticness" )

Classical is once again an outlier to the rest of the data, as well as jazz (makes sense, as classical and jazz music typically consist entirely of acoustic instruments, and very little electronic production). All other genres seem somewhat similarly distributed as being lower in acousticness. NOTE: rap and hip-hop distributions seem to consistently correspond, which makes sense because the genres are so related. I suspect acousticness to be correlated with energy, and I will explore this when I examine the energy predictor later on.

DANCEABILITY

music %>%
  ggplot(aes(x=music_genre, y=danceability, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Danceability by Genre", x="Genre", y="Danceability" )

Classical is noticeably the least danceable genre, where as hip-hop/rap are the most danceable, and all other genres are nearly the same.

DURATION

summary(music$duration_ms) #numeric distribution of duration in milliseconds
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      -1  174800  219281  221253  268612 4830606

There is an issue here. -1 is not a valid measurement of time, so there must be missing values. Let’s take a closer look:

sum(music$duration_ms=="-1")
## [1] 4939

This is a large number of observations with missing/ invalid values of duration. I will choose to fill these missing values with the median of the duration data, so as not to lose the duration variable by having to remove it:

music <- music %>% 
  mutate(duration_ms = ifelse(duration_ms==-1, #fill missing values with median
                            median(duration_ms, na.rm = T),
                            duration_ms))
sum(music$duration_ms==-1)
## [1] 0

Now that I filled in the missing values, lets look at the distribution by genre:

music %>%
  ggplot(aes(x=music_genre, y=duration_ms, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Track Duration by Genre", x="Genre", y="Duration in Milliseconds" )

There is an extreme outlier present in the electronic genre, as well as outliers in the classical and blues genres, but since the medians of each variable are similarly distributed, I’ll ignore the outliers. Overall, it seems the duration for classical tracks tends to be slightly longer than for any other genre.

INSTRUMENTALNESS

summary(music$instrumentalness)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000158 0.181601 0.155000 0.996000

The minimum and 1st quartile being 0 and median and mean being very small indicate an issue in the data. A plot should give us a better look:

music %>%
  ggplot(aes(instrumentalness)) + #distribution of instrumentalness
  geom_histogram(fill="#00A36C") +
  labs(title = "Distribution of Instrumentalness", x="Instrumentalness")

sum(music$instrumentalness==0.0)
## [1] 15001

It seems a large portion (15001 of 50000 observations, or 30%) of the instrumentalness observations equal 0. This is indicative of missing values rather than actual data points, and that is too many missing values to deal with by replacing them with the mean or median, so I will drop instrumentalness entirely from the dataset, rather than use it to build my models.

music <- music %>%
  select(-instrumentalness)

ENERGY

music %>%
  ggplot(aes(x=music_genre, y=energy, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Energy by Genre", x="Genre", y="Energy" )

Classical continues to stand apart from the rest of the genres, here in that it tends to be much less energetic, when compared to other genres.

Energy logically seems to correlate with certain variables: acousticness, liveness, loudness, and tempo. Instead of examining each of these variables individually, I will plot energy against each of them and separate the results by genre:

ACOUSTICNESS

music %>%
  ggplot(aes(x=energy, y=acousticness, color=music_genre)) + #color by genre
  geom_point(alpha=0.05) + #scatterplot
  facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
  geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
  theme(legend.position="none") + #remove legend (it was not useful)
  labs(title = "Energy vs Acousticness by Genre", x="Energy", y="Acousticness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

I see a strong negative correlation, meaning the more energetic a song, the less acoustic it is, which is the opposite of what I originally believed.

LIVENESS

Now to compare liveness and genre:

music %>%
  ggplot(aes(x=energy, y=liveness, color=music_genre)) + #color by genre
  geom_point(alpha=0.05) + #scatterplot
  facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
  geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
  theme(legend.position="none") + #remove legend
  labs(title = "Energy vs Liveness by Genre", x="Energy", y="Liveness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There is very little correlation between these 2, which is surprising to me as liveness seems to indicate something energetic. This leads me to believe that liveness here is more of a measure of how “live”, as in not simply recorded in a studio, the tracks are by genre. I will make note of this in my codebook.

LOUDNESS

music %>%
  ggplot(aes(x=energy, y=loudness, color=music_genre)) + #color by genre
  geom_point(alpha=0.05) + #scatterplot
  facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
  geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
  theme(legend.position="none") + #remove legend
  labs(title = "Energy vs Loudness by Genre", x="Energy", y="Loudness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Here, I see a strong positive correlation. This make complete sense, as loudness is a measure of sound, and sound, of course, is a form of energy.

TEMPO

I believe there are some missing values for the tempo predictor:

sum(is.na(music$tempo))
## [1] 4980

Lets replace the missing values of tempo with the median of the tempo data first:

music <- music %>% 
  mutate(tempo = ifelse(is.na(tempo), #fill missing values with median
                            median(tempo, na.rm = T),
                            tempo))
sum(is.na(music$tempo))
## [1] 0

Now to plot tempo against energy by genre:

music %>%
  ggplot(aes(x=energy, y=tempo, color=music_genre)) + #color by genre
  geom_point(alpha=0.05) + #scatterplot
  facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
  geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
  theme(legend.position="none") + #remove legend
  labs(title = "Energy vs Tempo by Genre", x="Energy", y="Tempo")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There is very little correlation between tempo and energy, so I was wrong in assuming they were correlated.

KEY

What keys of music are present in the dataset?

unique(music$key) #what distinct keys are there in the data
##  [1] "A#" "D"  "G#" "C#" "F#" "B"  "G"  "F"  "A"  "C"  "E"  "D#"

Because key is categorical, I will opt for barplots separated by genre to examine the distribution:

music %>%
  ggplot(aes(x=key, fill=key)) + #color by genre
  geom_bar() + #barplots
  facet_wrap(~music_genre, scales = "free") + #separate by genre for readability
  labs(title = "Key Distribution by Genre", x="Key")

Each genre looks to have a distinct spead of key, which means this will be a helpful variable in building the model.

Because key is a categorical variable, I will have to create dummy variables/ use One Hot Encoding to use it in my model. One Hot Encoding is when you separate the factors into various columns, and use 1s and 0s to indicate whether or not a track falls under that key.

music$key <- as.factor(music$key) #convert from character to factor
music <- one_hot(as.data.table(music)) #one hot encode

MODE

There are only two modes which music can fall under:

unique(music$mode)
## [1] "Minor" "Major"

To put it broadly, major songs tend to sound more happy, while minor songs tend to sound more dark or sad.

mycolors <- c("#FFBF00", "#00A36C") #colors of plot chosen by my sister
music %>%
  group_by(mode, music_genre) %>% #to separate by mode, then genre
  count() %>% #how many observations per mode type
  ggplot(aes(music_genre, n, fill = mode)) + #color by mode
  geom_col(position="dodge")  + #side by side
  scale_fill_manual(values=mycolors) + #apply chosen colors
  labs(title = "Mode Distribution by Genre", x="Genre", y="count")

There seems to be a preference towards the major key for all genres, particularly country.

I will use One Hot Encoding for mode as well:

music$mode <- as.factor(music$mode) #convert from character to factor
music <- one_hot(as.data.table(music)) #one hot encode

SPEECHINESS

music %>%
  ggplot(aes(x=music_genre, y=speechiness, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Speechiness by Genre", x="Genre", y="Speechiness" )

Rap and hip-hop stick out as being particularly speechy, which should be helpful for later identification. Classical and country are hardly speechy at all.

VALENCE
With regards to music, valence is a measure of percieved “positivity” within a song. The higher the valence, the more “upbeat” the song sounds, and vice versa.

music %>%
  ggplot(aes(x=music_genre, y=valence, fill=music_genre)) + #color by genre
  geom_boxplot() + #boxplots
  labs(title = "Valence by Genre", x="Genre", y="Valence" )

Classical stands out as having lower valence on average than other genres.

EDA Final Touches

Because I extracted the lengths of both the artist and track names, I will drop the original variables and keep the new ones to maintain independence among predictors.

music <- music %>%
  select(-artist_name, -track_name) #drip artist and track name

Overall, I notice that for a lot of the features, genres tend to have very similar distributions to each other, with the exception of one or two genres each time. This tells me it will be hard to build a model that flawlessly distinguishes between genre every time.

Thus, I will focus on building and choosing the best possible model, even if it is not perfectly accurate in nature.

MODEL PREPARATION

First, I will separate labels (predictions) from features (predictors):

music_features <- music %>%
  select(-music_genre) #predictors
music_labels <- music %>% 
  select(music_genre) #predictions

Next, I will scale the features, which means I will center the data of each predictor variable around 0, and normalize it to have a standard deviation of 1. This will help keep things even across the board when when building and comparing the models. I will separate the One Hot Encoded variables from the data prior to scaling, as scaling and normalizing a bunch of 1s and 0s does not make any sense intuitively (like taking the average of true and false). Then, I will scale the necessary features and reattach the two data frames. Finally, I will transform genre into a factor rather than a set of character variables to I can actually go about making predictions.

#separate numeric features and place in one data frame
to_scale <- music_features %>%
  select(popularity, acousticness, danceability, duration_ms, energy, liveness, loudness,
         speechiness, tempo, valence, Aname_length, Tname_length) 
#separate one hot encoded features and place in second data frame
features_encoded <- music_features %>%
  select(-popularity, -acousticness, -danceability, -duration_ms, -energy, -liveness,
         -loudness, -speechiness, -tempo, -valence, -Aname_length, -Tname_length)
#scale the chosen numerical features
features_scaled <- to_scale %>%
  scale() %>%
  as.data.frame()
colMeans(features_scaled) #all equal (basically) 0
##    popularity  acousticness  danceability   duration_ms        energy 
##  1.782835e-16 -3.944527e-17 -1.208429e-16 -6.445851e-17 -2.622440e-17 
##      liveness      loudness   speechiness         tempo       valence 
## -1.030032e-16  1.521437e-16 -6.332726e-17  1.176359e-16  9.334596e-17 
##  Aname_length  Tname_length 
## -1.799978e-16 -5.442050e-17
#rejoin the 2 data frames into one
features_processed <- as.data.frame(c(features_scaled, features_encoded)) 
#make genre a factor rather than character type
features_processed$music_genre <- as.factor(music_labels$music_genre)
glimpse(features_processed)
## Rows: 50,000
## Columns: 27
## $ popularity   <dbl> -1.10799194, -0.85062494, -1.04365019, -0.65759969, -0.78…
## $ acousticness <dbl> -0.8838774, -0.8603818, -0.8886234, -0.8231755, -0.883965…
## $ danceability <dbl> 0.52487283, 0.35692975, 0.34573354, 1.20784137, 0.4464993…
## $ duration_ms  <dbl> -0.222787729, -0.232101866, -0.257366933, -0.716832921, -…
## $ energy       <dbl> 1.28986301, 1.09708957, 0.58680694, 0.37891401, -0.048211…
## $ liveness     <dbl> -0.48810859, -0.43242829, 2.10411858, -0.22826720, -0.228…
## $ loudness     <dbl> 0.63812554, 0.33924462, 0.73288475, 0.75219356, 0.4653198…
## $ speechiness  <dbl> -0.185319962, -0.627251443, -0.582861004, 1.434437834, -0…
## $ tempo        <dbl> -0.655412965, -0.170024909, 0.276808622, 0.277496481, 0.8…
## $ valence      <dbl> 1.2250610, 0.3024276, -0.4988067, -0.7537449, -0.5392731,…
## $ Aname_length <dbl> -0.7959179, 1.6583842, 0.4312332, -0.7959179, -0.1823424,…
## $ Tname_length <dbl> -0.01570766, -0.24836201, -0.65550713, -0.88816149, -0.24…
## $ key_A        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ key_A.       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_B        <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ key_C        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ key_C.       <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ key_D        <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ key_D.       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_E        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F        <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F.       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G        <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ key_G.       <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ mode_Major   <int> 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, …
## $ mode_Minor   <int> 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, …
## $ music_genre  <fct> Electronic, Electronic, Electronic, Electronic, Electroni…

Now that I have my final version of the dataset, processed as necessary, I will split the data into a training and test set upon which to analyze and compare my model predictions to.

DATA SPLIT

Here, I split the data into an 80% training, 20% test set, using stratified sampling. I used stratified sampling to increase randomness, particularly in the genre, as the dataset looks to have long stretches of entries with the same genre in a row.

set.seed(123) #to ensure replicability when randomly splitting and stratifying data
music_split <- features_processed %>% 
  initial_split(prop = 0.8, strata = "music_genre") #80/20 split with stratified sampling

music_train <- training(music_split) #80 goes to training set
music_test <- testing(music_split) #20 goes to test set
glimpse(music_train) #taking a look at training set
## Rows: 40,000
## Columns: 27
## $ popularity   <dbl> 0.24318479, 0.11450129, -0.07852396, -0.07852396, 0.43621…
## $ acousticness <dbl> 0.15414918, -0.02162863, -0.02748789, -0.86911206, -0.896…
## $ danceability <dbl> -0.757092705, 0.978319157, 0.603246271, 0.524872832, -0.8…
## $ duration_ms  <dbl> 1.75240318, -0.19916301, 0.18521483, -0.15630666, -0.1208…
## $ energy       <dbl> 0.522549122, -0.588733064, 0.530108864, 0.806039475, 0.88…
## $ liveness     <dbl> -0.7040244, -0.5561623, -0.3025076, 0.5945639, 0.1676816,…
## $ loudness     <dbl> 0.29283857, 0.31863774, 0.52340849, 0.71341368, 0.9217541…
## $ speechiness  <dbl> -0.6085088, 0.2112346, -0.5532674, -0.5187415, -0.5059176…
## $ tempo        <dbl> -0.293426903, -1.060665389, 0.170775080, 0.139443080, 2.3…
## $ valence      <dbl> -0.72946507, -0.34098785, 0.66257831, 0.41978004, 0.76374…
## $ Aname_length <dbl> 0.6357583, -0.3868676, -0.1823424, -0.5913927, -1.2049683…
## $ Tname_length <dbl> -0.19019842, -0.77183431, -0.13203484, -0.42285278, -0.30…
## $ key_A        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_A.       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_B        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ key_C        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ key_C.       <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
## $ key_D        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ key_D.       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_E        <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F.       <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G        <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G.       <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ mode_Major   <int> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, …
## $ mode_Minor   <int> 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, …
## $ music_genre  <fct> Alternative, Alternative, Alternative, Alternative, Alter…

While training my models using repeated cross validation, I learned that it was quite a lengthy process on my machine. Thus, I decided to write my models to external .rda files, so that I could load the files in later on to access the models without having to rerun the cross validation, or force the knittr to rerun cross validation every time I tried to knit my project (one attempt to knit even took over 20 hours before I decided to do this). Quickly grabbing my current working directory so I know where to write the files to..

getwd()
## [1] "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject"

and now onto actually training and building the models. Exciting!

MODEL BUILDING

For my project, because I am working to predict categorical data, I decided to build a few classification models, including the following, using repeated cross validation. I did so primarily using the caret package, so that the training data could be folded within the building of the model, using the trainControl() function:

  1. Random Forest
  2. Boosted Trees
  3. k-NN (k Nearest Neighbors)
  4. SVM (Support Vector Machine)

Random Forest

Over the many times I had to run this process, it took an average of two hours to run each time; originally, I attempted to train and tune the model with repeated cross validation, but the runtime was simply too exhaustive on my machine (over 4 hours, and going) for me to justify keeping it in. Thus, I trained this model using only 10 fold cross validation with no repeats.

I chose a maximum of 25 for my tuning grid for the mtry parameter because my training set contains 26 predictors.

After training the model and saving it to an external .rda file, I completely commented the code out, as setting cache=TRUE in the r header was being ignored by the knittr for some reason.

#rf.fitControl <- trainControl(method="cv", number=10) 10 fold cross validation
#tunegrid <- expand.grid(.mtry=c(2:25)) between 2 and 25 for my parameters
#rf_music2 <- train(music_genre ~., data=music_train, method="rf", 
#               tuneGrid=tunegrid, ntree = 100, trControl=rf.fitControl)
#rf_music2
#save(rf_music2, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/rfModel.rda")

The optimal value for the mtry parameter was determined by accuracy=55%, and set to be 7 by the tuning process. Now to load the model back in from the external .rda so I can actually examine it and use it to predict on the test data.

load("rfModel.rda") #load random forest model from external file

I plotted the performance of the random forest model to see the progression of the tuning process:

ggplot(rf_music2) #performance plot of random forest

It seems performance of the model rapidly increased until mtry=7, then slowly tapered off as the mtry parameter value increased. Interestingly, an mtry value of 11 came in a close second to the optimal value.

Now, to build the random forest model with the optimal value of mtry achieved by tuning and predict on the test set. I created a confusion matrix so as to see the ratio of correct predictions to incorrect predictions by the model:

rf.music <- predict(rf_music2, music_test) #predicting on test set
#building confusion matrix to compare predictions to actual test data
confusionMatrix(reference = music_test$music_genre, data = rf.music, mode='everything')
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
##   Alternative         377    28    52        24      63         74      35   29
##   Anime                 8   775    76        35      10         49       1   17
##   Blues                18    45   487        28      42         73       2  136
##   Classical             3    50    11       851       0          5       0   42
##   Country              94    18    82         9     550         29       5   54
##   Electronic           49    52    58        13      38        589      14  113
##   Hip-Hop              95     0     6         0      11         25     399   36
##   Jazz                 83    22   173        34      78        107      15  520
##   Rap                  59     0     1         0      12         19     483   13
##   Rock                214    10    54         6     196         30      46   40
##              Reference
## Prediction    Rap Rock
##   Alternative  42  161
##   Anime         2    5
##   Blues         0   15
##   Classical     0    2
##   Country       5  100
##   Electronic    5    1
##   Hip-Hop     559   43
##   Jazz          3   16
##   Rap         291   64
##   Rock         93  593
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5432         
##                  95% CI : (0.5334, 0.553)
##     No Information Rate : 0.1            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4924         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Alternative Class: Anime Class: Blues
## Sensitivity                      0.3770       0.7750       0.4870
## Specificity                      0.9436       0.9774       0.9601
## Pos Pred Value                   0.4260       0.7924       0.5757
## Neg Pred Value                   0.9317       0.9751       0.9440
## Precision                        0.4260       0.7924       0.5757
## Recall                           0.3770       0.7750       0.4870
## F1                               0.4000       0.7836       0.5276
## Prevalence                       0.1000       0.1000       0.1000
## Detection Rate                   0.0377       0.0775       0.0487
## Detection Prevalence             0.0885       0.0978       0.0846
## Balanced Accuracy                0.6603       0.8762       0.7236
##                      Class: Classical Class: Country Class: Electronic
## Sensitivity                    0.8510         0.5500            0.5890
## Specificity                    0.9874         0.9560            0.9619
## Pos Pred Value                 0.8828         0.5814            0.6320
## Neg Pred Value                 0.9835         0.9503            0.9547
## Precision                      0.8828         0.5814            0.6320
## Recall                         0.8510         0.5500            0.5890
## F1                             0.8666         0.5653            0.6097
## Prevalence                     0.1000         0.1000            0.1000
## Detection Rate                 0.0851         0.0550            0.0589
## Detection Prevalence           0.0964         0.0946            0.0932
## Balanced Accuracy              0.9192         0.7530            0.7754
##                      Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity                  0.3990      0.5200     0.2910      0.5930
## Specificity                  0.9139      0.9410     0.9277      0.9234
## Pos Pred Value               0.3399      0.4948     0.3089      0.4626
## Neg Pred Value               0.9319      0.9464     0.9217      0.9533
## Precision                    0.3399      0.4948     0.3089      0.4626
## Recall                       0.3990      0.5200     0.2910      0.5930
## F1                           0.3671      0.5071     0.2997      0.5197
## Prevalence                   0.1000      0.1000     0.1000      0.1000
## Detection Rate               0.0399      0.0520     0.0291      0.0593
## Detection Prevalence         0.1174      0.1051     0.0942      0.1282
## Balanced Accuracy            0.6564      0.7305     0.6093      0.7582

The accuracy for mtry=7 random forest model is about 54% on test data, almost exactly like on training data (55%). I believe this indicates that the model did not overfit to the training set, which I was slightly worried about because of the 80/20 ratio I used for the training and test split.

From the confusion matrix, I see that the model often confused hip-hop for rap, and rap for hip-hop, which makes sense because of the similarity of the genres. This makes me believe the models I fit may perform better if I were to combine the two genres, but I wanted to keep the data as pure as possible for this project to test the ability of specific genre distinctions.

Rock/ alternative and rock/ country also had similar overlap within the confusion matrix.

Now, to take a look at which variables were most important to the model:

varImpRF<-varImp(rf_music2) #order of variable importance to random forest model
ggplot(varImpRF, main="Variable Importance with Random Forest") #plot importance

Popularity was by far the most important feature when it comes to predicting genre, along with loudness, speechiness, and danceability (I assume for distinguishing rap and hip-hop). Key and mode seem to be least important, but this is a misclassification of key particularly, as it was split 12 ways when it was One Hot Encoded earlier. Therefore, I will take mode and liveness to be the true least important variables.

Because popularity is so crucial to the determination of genre, I am led to believe that genre is less of an inherest characteristic of music, but rather of phenomenon of how humans interact with music, and attempt to find patterns in it. More on this later…

Boosted Trees

Next, I decided to fit a boosted tree model to the training set, and train the model using 10 fold cross validation, repeated three times to keep my machine from suffering through 3 hour+ runtimes. Training and tuning this model took just under 3 hours. Once again, I wrote the model to an external file to help keep things quick and simple.

#gbmFitControl <- trainControl(## 10-fold CV
#                           method = "repeatedcv",
 #                          number = 10,
  #                         ## repeated three times
   #                        repeats = 3)
#gbmFit1 <- train(music_genre ~ ., data = music_train, 
 #                method = "gbm", 
  #               trControl = gbmFitControl)
#gbmFit1
#save(gbmFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/boostModel.rda")

The results of the optimal tuned parameters are as follows:

150: Optimal number of trees (number of iterations)
3: Optimal interaction depth (complexity of the tree)
0.1: Optimal learning rate (how fast the algorithm adapts, \(\lambda\))

Now to load the boost model from the external file:

load("boostModel.rda") #loading boost model from external file

Plotting the performance of the boosted model:

ggplot(gbmFit1)

I assessed that accuracy would continue to taper off as the number of boosting iterations increased, so I stuck with the chosen tuned value of 150. Now onto predicting on the test set and building another confusion matrix:

boost.music <- predict(gbmFit1, music_test) #predicting on test data
#building confusion matrix to compare predictions to actual test data
confusionMatrix(reference = music_test$music_genre, data = boost.music, mode='everything')
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
##   Alternative         411    30    47        24      72         81      36   44
##   Anime                 4   752    75        23       8         49       1   13
##   Blues                14    62   486        30      32         77       2  130
##   Classical             2    55    17       852       0          3       0   39
##   Country             101    18    90        13     567         33       9   59
##   Electronic           51    58    49        17      36        595      13   97
##   Hip-Hop              82     0     5         0      14         24     453   32
##   Jazz                 73    16   166        35      87         87      16  532
##   Rap                  58     2     4         0       9         21     408   15
##   Rock                204     7    61         6     175         30      62   39
##              Reference
## Prediction    Rap Rock
##   Alternative  48   92
##   Anime         2    3
##   Blues         0    2
##   Classical     0    3
##   Country       2   59
##   Electronic    3    3
##   Hip-Hop     411   42
##   Jazz         12   13
##   Rap         418   54
##   Rock        104  729
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5795          
##                  95% CI : (0.5698, 0.5892)
##     No Information Rate : 0.1             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5328          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Alternative Class: Anime Class: Blues
## Sensitivity                      0.4110       0.7520       0.4860
## Specificity                      0.9473       0.9802       0.9612
## Pos Pred Value                   0.4644       0.8086       0.5820
## Neg Pred Value                   0.9354       0.9727       0.9439
## Precision                        0.4644       0.8086       0.5820
## Recall                           0.4110       0.7520       0.4860
## F1                               0.4361       0.7793       0.5297
## Prevalence                       0.1000       0.1000       0.1000
## Detection Rate                   0.0411       0.0752       0.0486
## Detection Prevalence             0.0885       0.0930       0.0835
## Balanced Accuracy                0.6792       0.8661       0.7236
##                      Class: Classical Class: Country Class: Electronic
## Sensitivity                    0.8520         0.5670            0.5950
## Specificity                    0.9868         0.9573            0.9637
## Pos Pred Value                 0.8774         0.5962            0.6453
## Neg Pred Value                 0.9836         0.9521            0.9554
## Precision                      0.8774         0.5962            0.6453
## Recall                         0.8520         0.5670            0.5950
## F1                             0.8645         0.5812            0.6191
## Prevalence                     0.1000         0.1000            0.1000
## Detection Rate                 0.0852         0.0567            0.0595
## Detection Prevalence           0.0971         0.0951            0.0922
## Balanced Accuracy              0.9194         0.7622            0.7793
##                      Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity                  0.4530      0.5320     0.4180      0.7290
## Specificity                  0.9322      0.9439     0.9366      0.9236
## Pos Pred Value               0.4262      0.5130     0.4226      0.5145
## Neg Pred Value               0.9388      0.9478     0.9354      0.9684
## Precision                    0.4262      0.5130     0.4226      0.5145
## Recall                       0.4530      0.5320     0.4180      0.7290
## F1                           0.4392      0.5223     0.4203      0.6032
## Prevalence                   0.1000      0.1000     0.1000      0.1000
## Detection Rate               0.0453      0.0532     0.0418      0.0729
## Detection Prevalence         0.1063      0.1037     0.0989      0.1417
## Balanced Accuracy            0.6926      0.7379     0.6773      0.8263

Here we see the boost model is 57% accurate, which is an improvement, however small, from the previously trained random forest model. Once again, hip-hop and rap were greatly confused for each other, alongside country/ rock, jazz/ blues, and alternative/ rock. Looking at variable importance once more:

varImpBoost<-varImp(gbmFit1) #ordered variable importance
ggplot(varImpBoost, main="Variable Importance with BOOST") #plotting importance

In the boosted model, predictors popularity, loudness, speechiness, and danceability are the most important, just like in the random forest model. Once again, liveness and mode are the least important features of the model, although now the least important variables as well as tempo, duration, and energy are far less important that in the random forest model. This model being more picky with its predictors is actually a good sign in terms of it being a robust model.

k-NN

Training and tuning the k nearest neighbors model only took around 40 minutes. Again, I used 10 fold cross validation repeated thrice, and wrote the model to an external file.

#knnFitControl <- trainControl(## determine k for best number of neighbors
#                           method = "repeatedcv",
#                           ## 10 folds
#                           number= 10,
                           ## repeated three times
#                           repeats = 3)
#knnFit1 <- train(music_genre ~ ., data = music_train, 
#                 method = "knn", 
#                 trControl = knnFitControl)
#knnFit1
#save(knnFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/knnModel.rda")

The only parameter being tuned in this model is of course k, or the ideal number of neighbors. Loading it in:

load("knnModel.rda")

and plotting the performance:

ggplot(knnFit1)

The accuracy looks like it would have continued to increased linearly as the number of neighbors increased, and I’m really not sure why that is. I chose to stick with the tuned value chosen by the model to avoid overcomplicating things.

Once again, onto prediction and confusion matrix.

knn.music <- predict(knnFit1, music_test)
confusionMatrix(reference = music_test$music_genre, data = knn.music, mode="everything")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
##   Alternative         299    35    64        15      97        102      62   70
##   Anime                 8   669    93        37       9         69       0   22
##   Blues                35    86   403        23      52         60       1  152
##   Classical             1    60    15       847       3          8       0   54
##   Country             166    54   157        27     575         74      30  110
##   Electronic           65    58    47        14      23        492      24   79
##   Hip-Hop             110     2     9         1      23         50     406   43
##   Jazz                 52    28   160        34      58         92      12  427
##   Rap                  76     1     2         0      24         24     404   22
##   Rock                188     7    50         2     136         29      61   21
##              Reference
## Prediction    Rap Rock
##   Alternative  62  168
##   Anime         0    4
##   Blues         3   21
##   Classical     0    6
##   Country      26  170
##   Electronic   13   20
##   Hip-Hop     445   46
##   Jazz         14   26
##   Rap         325   49
##   Rock        112  490
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4933          
##                  95% CI : (0.4835, 0.5031)
##     No Information Rate : 0.1             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.437           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Alternative Class: Anime Class: Blues
## Sensitivity                      0.2990       0.6690       0.4030
## Specificity                      0.9250       0.9731       0.9519
## Pos Pred Value                   0.3070       0.7344       0.4821
## Neg Pred Value                   0.9223       0.9636       0.9349
## Precision                        0.3070       0.7344       0.4821
## Recall                           0.2990       0.6690       0.4030
## F1                               0.3029       0.7002       0.4390
## Prevalence                       0.1000       0.1000       0.1000
## Detection Rate                   0.0299       0.0669       0.0403
## Detection Prevalence             0.0974       0.0911       0.0836
## Balanced Accuracy                0.6120       0.8211       0.6774
##                      Class: Classical Class: Country Class: Electronic
## Sensitivity                    0.8470         0.5750            0.4920
## Specificity                    0.9837         0.9096            0.9619
## Pos Pred Value                 0.8521         0.4140            0.5892
## Neg Pred Value                 0.9830         0.9506            0.9446
## Precision                      0.8521         0.4140            0.5892
## Recall                         0.8470         0.5750            0.4920
## F1                             0.8495         0.4814            0.5362
## Prevalence                     0.1000         0.1000            0.1000
## Detection Rate                 0.0847         0.0575            0.0492
## Detection Prevalence           0.0994         0.1389            0.0835
## Balanced Accuracy              0.9153         0.7423            0.7269
##                      Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity                  0.4060      0.4270     0.3250      0.4900
## Specificity                  0.9190      0.9471     0.9331      0.9327
## Pos Pred Value               0.3577      0.4729     0.3506      0.4471
## Neg Pred Value               0.9330      0.9370     0.9256      0.9427
## Precision                    0.3577      0.4729     0.3506      0.4471
## Recall                       0.4060      0.4270     0.3250      0.4900
## F1                           0.3803      0.4488     0.3373      0.4676
## Prevalence                   0.1000      0.1000     0.1000      0.1000
## Detection Rate               0.0406      0.0427     0.0325      0.0490
## Detection Prevalence         0.1135      0.0903     0.0927      0.1096
## Balanced Accuracy            0.6625      0.6871     0.6291      0.7113

The accuracy of k-NN model with k=9 comes in at 49%. This model really seemed to confuse hip for rap and vice versa, as with all my previous models. In fact, it seemed to confuse more genres for other than any of my other models, like hip-hop/alternative, jazz/country, and country/blues. Overall, it was most sensitive to classical music, which makes sense as it was continuously differentiated from other genres as seen in the EDA.

varImpKNN<-varImp(knnFit1) #order of variable importance
ggplot(varImpKNN, main="Variable Importance with k-NN") #plotting variable importance

A bit hard to see, but popularity is definitely most important feature across genres, besides rock which benefits slightly more from loudness. Interestingly, loudness is also important for identifying anime and electronic tracks. Loudness is not nearly as important for genres like hip-hop, rap, jazz, blues, classical, and country.

SVM

I chose not to conduct PCA and simply train the SVM model on the data as is, as my data had a mix of continuous and categorical (key and mode) data, and upon researching and seeing conflicting opinions, I learned that typically PCA is not useful when data has One Hot Encoded variables (all 0s and 1s), and it can skew the weight of the principal components.

Training and tuning this model took a bit over an hour, so I wrote it to an external file as before:

#svmFitControl <- trainControl(## 10-fold CV
#                           method = "repeatedcv",
#                           number = 10,
#                           ## repeated three times
#                           repeats = 3)
#svmFit1 <- train(music_genre ~ ., data = music_train, 
#                 method = 'svmLinear', #chose a linear kernel for classification
#                 trControl = svmFitControl)
#svmFit1
#save(svmFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/svmModel.rda")

There was only one parameter to tune here (C, as the kernel was linear).

Loading in the final model:

load("svmModel.rda")
svm.music <- predict(svmFit1, music_test)
confusionMatrix(reference = music_test$music_genre, data = svm.music, mode='everything')
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
##   Alternative         317    29    32        15      69         70      74   36
##   Anime                 5   679   160        57      19         83       1   41
##   Blues                17    77   406        25      76         66       1  130
##   Classical             5    82    11       824       1          8       0   54
##   Country             168    24    86        13     490         41      17   78
##   Electronic           97    79    66        30      56        574      25  104
##   Hip-Hop              94     1     1         0      14         40     517   35
##   Jazz                 80    23   187        29      92         79      17  482
##   Rap                  47     1     1         0       5         18     286    6
##   Rock                170     5    50         7     178         21      62   34
##              Reference
## Prediction    Rap Rock
##   Alternative  77  115
##   Anime         2    3
##   Blues         0    3
##   Classical     0    2
##   Country      22   63
##   Electronic   10   10
##   Hip-Hop     424   26
##   Jazz         13   24
##   Rap         344   65
##   Rock        108  689
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5322         
##                  95% CI : (0.5224, 0.542)
##     No Information Rate : 0.1            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4802         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Alternative Class: Anime Class: Blues
## Sensitivity                      0.3170       0.6790       0.4060
## Specificity                      0.9426       0.9588       0.9561
## Pos Pred Value                   0.3801       0.6467       0.5069
## Neg Pred Value                   0.9255       0.9641       0.9354
## Precision                        0.3801       0.6467       0.5069
## Recall                           0.3170       0.6790       0.4060
## F1                               0.3457       0.6624       0.4509
## Prevalence                       0.1000       0.1000       0.1000
## Detection Rate                   0.0317       0.0679       0.0406
## Detection Prevalence             0.0834       0.1050       0.0801
## Balanced Accuracy                0.6298       0.8189       0.6811
##                      Class: Classical Class: Country Class: Electronic
## Sensitivity                    0.8240         0.4900            0.5740
## Specificity                    0.9819         0.9431            0.9470
## Pos Pred Value                 0.8349         0.4890            0.5461
## Neg Pred Value                 0.9805         0.9433            0.9524
## Precision                      0.8349         0.4890            0.5461
## Recall                         0.8240         0.4900            0.5740
## F1                             0.8294         0.4895            0.5597
## Prevalence                     0.1000         0.1000            0.1000
## Detection Rate                 0.0824         0.0490            0.0574
## Detection Prevalence           0.0987         0.1002            0.1051
## Balanced Accuracy              0.9029         0.7166            0.7605
##                      Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity                  0.5170      0.4820     0.3440      0.6890
## Specificity                  0.9294      0.9396     0.9523      0.9294
## Pos Pred Value               0.4488      0.4698     0.4450      0.5204
## Neg Pred Value               0.9454      0.9423     0.9289      0.9642
## Precision                    0.4488      0.4698     0.4450      0.5204
## Recall                       0.5170      0.4820     0.3440      0.6890
## F1                           0.4805      0.4758     0.3880      0.5929
## Prevalence                   0.1000      0.1000     0.1000      0.1000
## Detection Rate               0.0517      0.0482     0.0344      0.0689
## Detection Prevalence         0.1152      0.1026     0.0773      0.1324
## Balanced Accuracy            0.7232      0.7108     0.6482      0.8092

The SVM model has an accuracy of about 53%, almost as accurate as the random forest model. Let us check out which variables are most important for SVM:

varImpSVM<-varImp(svmFit1)
ggplot(varImpSVM, main="Variable Importance with Support Vector Machines (SVM)")

Similarly to k-NN, rock favored loudness slightly, but popularity beat out all other predictors in terms of importance between genres.

Now to compare each model’s performance against eachothers:

MODEL PERFORMANCE

# Compare model performances using resample()
# Could not compare against random forest model as I could not do repeated CV
# thus resulting in a different number of resamples between these models and RF.
models_compare <- resamples(list(BOOST=gbmFit1, KNN=knnFit1, SVM=svmFit1))
# Summary of the models performances
summary(models_compare)
## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: BOOST, KNN, SVM 
## Number of resamples: 30 
## 
## Accuracy 
##          Min.   1st Qu.   Median      Mean   3rd Qu.    Max. NA's
## BOOST 0.56050 0.5736250 0.578125 0.5783417 0.5830625 0.59075    0
## KNN   0.46675 0.4842500 0.489375 0.4874000 0.4921875 0.49775    0
## SVM   0.51775 0.5313125 0.535250 0.5353917 0.5392500 0.55125    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## BOOST 0.5116667 0.5262500 0.5312500 0.5314907 0.5367361 0.5452778    0
## KNN   0.4075000 0.4269444 0.4326389 0.4304444 0.4357639 0.4419444    0
## SVM   0.4641667 0.4792361 0.4836111 0.4837685 0.4880556 0.5013889    0

Plotting the above comparisons for visual ease:

scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

Thus, the boosted model is the most accurate of the three models plotted here, and when comparing the accuracy of the boosted model (57%) to the accuracy of the random forest model (54%), I can see that the boosted model still beats it out.

CHOSEN MODEL BASED ON ACCURACY: BOOST

FINAL MODEL BUILDING

final_model <- gbmFit1 #chosen model is boosted model

Checking a few predictions

predict(final_model, newdata = head(music_test)) #predicted values
## [1] Electronic  Blues       Electronic  Electronic  Anime       Alternative
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
head(music_test$music_genre) #actual test values
## [1] Electronic Electronic Electronic Electronic Electronic Electronic
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock

It seems even the best model was only able to correctly predict 3 out of the first 6 entries of the test set, (~50% accuracy).

predict(final_model, newdata = tail(music_test)) 
## [1] Hip-Hop Hip-Hop Rap     Rap     Rap     Hip-Hop
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
tail(music_test$music_genre)
## [1] Hip-Hop Hip-Hop Hip-Hop Hip-Hop Hip-Hop Hip-Hop
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock

Of the last 6 entries of the test set, the model once again correctly predicted the genre 3 out of 6 times.

CONCLUSION

Overall, of the four models I built and trained on the final cleaned and transformed dataset, the boosted tree model performed best based on accuracy of predictions on the test set, but not by much. It seems that regardless of the model I chose to train and test, the accuracy never seemed to cross the threshold of 60%. This led me to a few different conclusions.

First, I do not believe any of my models were particularly effective because it seems that among the genres present in the dataset, quite a few shared too many similarities to be properly separated by the model. In particular, rap and hip-hop, rock and country, and jazz and blues seemed to always be lumped together. This is not necessarily surprising, as I personally would have a difficult time distinguishing between some of these genres depending on the track. I do believe some of these models would have performed much better differentiating between genres if certain combinations of genres were lumped together. That is certainly an idea for the next steps I could take if I were to continue this analysis of genre. Perhaps if I grouped certain genres together, I could build a model that was more accurate, or I could choose to build a model that was focused only on being incredibly good at identifying whether or not a track fell under a specific genre, like classical.

This leads me to my second conclusion. I noticed that regardless of which model I looked at, classical was always significantly more distinguishable than other genres. The boosted model was also fairly sensitive to anime and rock, which generally were indistinguishable while analyzing spreads across genres in my EDA. This was one of the more surprising things about my analysis.

Finally, I am able to conclude that while certain inherent characteristics of music lead to the specification of a genre, the more specific the specifications of a genre gets, the more difficult it is to categorize music only by what genre it falls under. You could theoretically come up with thousands of sub genres underneath any single genre, as record companies often do, but it wouldn’t be nearly as useful of a classification tool at that point. While this model was interesting to examine, and showed me just how complex and varied the nature of music is, it was not super useful in helping classify songs. Music as a whole is much more enjoyable when we don’t give too much thought to genre, anyways.

Thanks for a great quarter!